Automated part - of - speech analysis of Urdu : conceptual and technical issues
نویسنده
چکیده
Part-of-speech (POS) tagging is the process of labelling tokens in a text with tags that indicate their morphosyntactic category, and has a wide range of applications in computational and corpus linguistics, such as the production of corpus-based dictionaries and grammars. This paper describes an experiment in extending POS tagging to a hitherto untagged language, Urdu. The most challenging task in POS tagging is disambiguation, i.e. the resolution of the contextual ambiguity of a token for which more than one tag is possible. Three important approaches to disambiguation have been developed: approaches based on rules devised by a linguist; probabilistic approaches based on the application of corpus-derived statistics in a mathematical model such as a Markov model; and Brill (1995)’s approach where rules are learned automatically from a corpus. However, given that only a small amount of pre-tagged data was available for Urdu, only the rulebased approach was appropriate for the Urdu tagger described here. A rule-based tagger for Urdu was created within the Unitag architecture, together with the requisite language-specific resources for Urdu (including a tagset, an analyser, a lexicon, and a rule list). An evaluation of the tagger suggests that it performs at a level of accuracy notably below that commonly reported for languages such as English. However, this poor performance is primarily attributable to the small size of the lexicon, which is attributable to the small quantity of training data available. The rule-based disambiguation rules was more successful.
منابع مشابه
Developing a tagset for automated part-of-speech tagging in Urdu
1. Abstract While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Litt...
متن کاملDeveloping a tagset for automated part - of - speech tagging in Urdu Andrew
While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has ...
متن کاملPoliteness Orientation in Social Hierarchies in Urdu
The present research is aimed at investigating how the politeness of the speakers of Urdu is influenced by their relative social status in society. The researcher took politeness theory of Brown and Levinson (1978, 1987) as a model. To observe politeness of Urdu speakers, speech act of apology with different strategies was selected. A Discourse Completion Task (DCT) was used as an instrument to...
متن کاملTesting Problems in Russian as a Foreign Language in a Technical University
Problems of theory and practice of the Russian as a foreign language testing for entrants in technical universities are considered. The benefits of test forms for controlling the foreign students’ skills in the Russian language during a hard time limit are presented. The structure and content of the tests, all types of tasks offered on the entrance and final examinations in the Russian languag...
متن کاملSemi-Semantic Part of Speech Annotation and Evaluation
This paper presents the semi-semantic part of speech annotation and its evaluation via Krippendorff’s α for the URDU.KON-TB treebank developed for the South Asian language Urdu. The part of speech annotation with the additional subcategories of morphology and semantics provides a treebank with sufficient encoded information. The corpus used is collected from the Urdu Wikipedia and news papers. ...
متن کامل